Joudy ALLAM
The Protein Data Bank is a database for the three-dimensional structural data of large biological molecules, such as proteins and nucleic acids. The data, typically obtained by X-ray crystallography, NMR spectroscopy, or cryo-electron microscopy, and submitted by biologists and biochemists from around the world, are freely accessible on the Internet. The purpose of this project is to retrieve from this database, a single or multiple PDB files of proteins and conduct some statistics on their contents.
Part 1
The first part consists of retrieving from the Protein Data bank one or more PDB files using Python libraries for web scrapping (Selenium, BeautifulSoup, ...)
Every record/entry in this database is represented by a 4-character unique identifier. A PDB file has a ".pdb" extension.
An example of a PDB file URL is presented below (3GWD being the identifier that changes between entries):
http://files.rcsb.org/view/3GWD.pdb
The program will start by prompting the user to enter a list of identifier(s) (separated by spaces) to retrieve from the database.
This part of the code must manage the below points:
EXPDTA and REMARK line with PROTEIN ATOMS greater than 0. The Other files should be discarded and deleted.import os
import re
import urllib.request
# Function to check if the ID is made of 4 alphanumeric characters
def valid(ID):
# Use a regular expression to check if the identifier has exactly 4 alphanumeric characters
return re.match("[a-zA-Z0-9]{4}$", ID) is not None
# Function to download locally the pdb files from the website
def download(identifier):
url = f"http://files.rcsb.org/view/{identifier}.pdb"
pdb_file = f"{identifier}.pdb"
# Download the PDB file if it doesn't exist locally
if not os.path.exists(pdb_file):
try:
with urllib.request.urlopen(url) as response, open(pdb_file, 'wb') as f: #open a file in write mode
data = response.read()
# Write the content to a local file
f.write(data)
except Exception :
with open(pdb_file, "w") as f:
f.write('not found') # If the ID is not found in the website, create a file and write 'not found' inside it
return pdb_file
# Function to check if the file exists in the web database
def found(pdb_file):
lines = [line.rstrip() for line in open(pdb_file,'r')] # Put the content of the file in a list
if lines[0] == 'not found':
return False
return True
# Function to check the criteria needed in the files
def check_structure(pdb_file):
with open(pdb_file, 'r') as file: # Opening the file in read mode
lines = file.readlines()
for line in lines:
if line.startswith("EXPDTA") and "X-RAY DIFFRACTION" not in line:
return False
if line.startswith("REMARK") and "PROTEIN ATOMS" in line:
protein_atoms = int(re.search(r': (\d+)', line).group(1)) # Find the protein atom number
if protein_atoms <= 0:
return False
return True
ids = input("Enter a list of identifier(s) separated by spaces: ").split()
valid_ids = []
invalid_ids = []
not_found_ids = []
for ID in ids:
if ID not in valid_ids:
if valid(ID):
pdb_file = download(ID) # If the ID is valid, download the file
if found(pdb_file):
if check_structure(pdb_file):
valid_ids.append(ID) # If the file is found, and structure is valid append the ID to the valid_ids list
else:
os.remove(pdb_file) # If the structure is not valid, disregard & delete the file
else:
not_found_ids.append(ID) # If the file is not found, append the ID to the not_found_ids list
else:
invalid_ids.append(ID) # If the ID is not valid, append the ID to the invalid_ids list
if invalid_ids:
print("Error: Invalid identifier(s):", ', '.join(invalid_ids)) # Print the invalid IDs
if not_found_ids:
print("Warning: Identifier(s) not found in the database:", ', '.join(not_found_ids)) # Print the IDs that are not found
if valid_ids:
print("Valid identifiers with protein structures determined by X-ray diffraction:", ', '.join(valid_ids)) # Print the valid IDs
Error: Invalid identifier(s): 23456 Warning: Identifier(s) not found in the database: 2345 Valid identifiers with protein structures determined by X-ray diffraction: 3GWD, 6BMW, 3BMW, 6GWD, 3HHH, 1BBB, 1CCC, 1EEE, 1FFF, 1GGG, 1HHH, 1III, 1KKK, 1MMM, 1NNN, 2AAA, 2BBB, 2CCC, 2DDD, 2FFF, 2GGG, 2HHH, 2III, 2JJJ, 2NNN
Part 2
The second part consists of computing some statistics about the content in amino acids of the protein files previously retrieved.
Note that there are many introductory lines and remarks at the top of the PDB file, as well as few lines at the end, which do not contain the amino acid information. The lines of interests are the one starting with the word ATOM and containing CA in the Atom name field.
20 different standard amino acids exist in proteins (A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y), and these amino acids are divided into 4 categories according to their properties:
In the PDB file, amino acids are represented by their 3-letter abbreviation. Check the picture below for the various correspondences.
For a description regarding the PDB file format check this link below: http://www.wwpdb.org/documentation/file-format-content/format33/sect9.html#ATOM
The script starts by displaying the following menu on the screen:
- If the user selects this option, the script will calculate the combined frequency and percentage of each of the 20 amino acid in the different PDB files and print the results in a table sorted in reverse numerical order according to the percentage values. The table must contain 3 columns that correspond to the amino acid one-letter, its frequency and its percentage.
- If the user selects this option, the script will do the same as in (1), but the displayed results will be per amino acids category. The table will contain only 4 rows, each corresponding to an amino acid category.
- If the user selects this option, he/she will be prompted to choose one category from the four, and the script will display the frequency and percentage of each amino acid belonging to this category. In this case, the percentage of the amino acid is computed by dividing its frequency over the total number of amino acids belonging to the same category.
- If the user select this option, he/she will be prompted to enter the one-letter code of the amino acid of interest, and the script will just compute and print the percentage of that specific amino acid. Return an error message if the user enters an invalid one-letter code.
For the first three options (1, 2, and 3), the script will also return a plot to illustrate graphically the results of the frequency and percentage (2 y-axis scales can be used for each of the 2 metrics). You can use any data visualization library that is supported in Python Jupyter Notebooks (Matplotlib, Seaborn, Plotly …). An interactive plot will be a plus.
import plotly.graph_objects as go
from plotly.offline import plot
categories = {"Positively charged": ["R", "H", "K"],
"Negatively charged": ["D", "E"],
"Polar": ["N", "C", "Q", "S", "T", "Y"],
"Non-polar": ["A", "I", "L", "M", "F", "P", "W", "V", "G"]}
letter_code = {'ALA': 'A', 'CYS': 'C', 'ASP': 'D', 'GLU': 'E', 'PHE': 'F',
'GLY': 'G', 'HIS': 'H', 'ILE': 'I', 'LYS': 'K', 'LEU': 'L',
'MET': 'M', 'ASN': 'N', 'PRO': 'P', 'GLN': 'Q', 'ARG': 'R',
'SER': 'S', 'THR': 'T', 'VAL': 'V', 'TRP': 'W', 'TYR': 'Y'}
# Function that gives the list of amino acids in the pdb file in their 1-letter name
def atom_list(ID):
atoms = []
with open(f"{ID}.pdb", "r") as f: # Open the pdb file in read mode
for line in f:
if line[0:6].strip() == 'ATOM' and line[12:16].strip() == 'CA':
# If the line starts with 'ATOM' and contains 'CA' between the indexes 12 and 16, append the letter code found between index 17 and 20 to the list atom
# Convert from the 3 letter code to the 1 letter code using the dictionary letter_code
atoms.append(letter_code[line[17:20]])
return atoms
# Function to calculate frequency and percentage of amino acids
def stats(atoms):
total = len(atoms) # Number of amino acids in the pdb files
freq = {aa: atoms.count(aa) for aa in set(atoms)} # Count the frequency of every amino acid in the list atoms and save it in the dictionary freq
percentage = {aa: (count / total) * 100 for aa, count in freq.items()} # Calculate the percentage of every amino acid using its frequency from freq
return freq, percentage
# Function to sort in reverse numerical order (variation of the Bubble Sort algorithm)
def sort(dictionary):
sorted_items = list(dictionary.items()) # Convert the dictionary into a list
n = len(sorted_items)
for i in range(n):
for j in range(0, n-i-1):
if sorted_items[j][1] < sorted_items[j+1][1]:
sorted_items[j], sorted_items[j+1] = sorted_items[j+1], sorted_items[j]
return sorted_items
# Function to display statistics for all amino acids in a table
def display_all_stats(atoms):
freq, percentage = stats(atoms) # Calculate the frequency and percentage of the amino acids using the stats function
sorted_percentage = sort(percentage) # Sort the percentages in reverse numerical order using the sort function
# Print a table containing the one letter code of every amino acid, its frequency, and its percentage in all the pdb files
print("Amino Acid | Frequency | Percentage")
print("-------------------------------------")
for aa, percent in sorted_percentage:
print(f" {aa} | {freq[aa]:<8} | {percent:.2f}%")
generate_plots(freq, percentage, 1) # Generate a plot of the frequencies and percentages of all the amino acids
# Function to display statistics per amino acid category
def display_per_category_stats(atoms):
freq, percentage = stats(atoms) # Calculate the frequency and percentage of the amino acids using the stats function
category_freq = {category: sum(freq[aa] for aa in aas)
for category, aas in categories.items()} # Calculate the frequency of the amino acids according to their respective categories
total = sum(category_freq.values())
category_percentage = {category: (
count / total) * 100 for category, count in category_freq.items()} # Calculate the percentage of the amino acids according to their respective categories
# Print a table containing each amino acid category, its frequency, and its percentage
print("Category | Frequency | Percentage")
print("--------------------------------------------")
for category in categories:
print(
f"{category:20}| {category_freq[category]:<8}| {category_percentage[category]:>9.2f}%")
generate_plots(category_freq, category_percentage, 2) # Generate a plot of the frequencies and percentages of the amino acid categories
# Function to display statistics for amino acids within a category
def display_within_category_stats(atoms, category):
atoms_of_category = []
for atom in atoms:
if atom in categories[category]:
atoms_of_category.append(atom) # If the amino acid in the list belongs to the chosen category, append it to the atoms_of_category list
freq, percentage = stats(atoms_of_category) # Calculate the stats of the amino acids in atoms_of_category
sorted_percentage = sort(percentage) # Sort the percentages in reverse numerical order
# Print a table containing the one letter code of every amino acid in the chosen category, its frequency, and its percentage in all the pdb files
print(f"\nCategory: {category}")
print("Amino Acid | Frequency | Percentage")
print("-------------------------------------")
for aa, percent in sorted_percentage:
print(f" {aa} | {freq[aa]:<8} | {percent:.2f}%")
generate_plots(freq, percentage, 3) #Generate a plot of the frequencies and percentages of the amino acids within the chosen category
# Function to display statistics for a specific amino acid
def display_specific_aa_stats(atoms):
#Enter the amino acid 1 letter code and return an error if invalid
aa = input("Enter the one-letter code of the amino acid of interest: ").upper()
if aa not in letter_code.values():
print("Error: Invalid one-letter code.")
return
#Calculate the statistics of the chosen amino acid
freq, percentage = stats(atoms)
total = sum(freq.values())
aa_freq = freq.get(aa, 0)
aa_percentage = (aa_freq / total) * 100
#Print the statistics of the chosen amino acid
print(f"Amino Acid: {aa}")
print(f"Frequency: {aa_freq}")
print(f"Percentage: {aa_percentage:.2f}%")
# Function to generate an interactive plot to illustrate graphically the results of the frequency and percentage
def generate_plots(freq, percentage, option):
fig = go.Figure()
# Add bar trace for frequency
fig.add_trace(go.Bar(x=list(freq.keys()), y=list(freq.values()), name='Frequency', marker_color='pink'))
# Add scatter trace for percentage with its own y-axis
fig.add_trace(go.Scatter(x=list(percentage.keys()), y=list(percentage.values()), mode='lines+markers', name='Percentage', yaxis='y2', marker_color='purple'))
# Update layout with separate y-axes for the frequency and percentage
fig.update_layout(title='Amino Acid Frequency and Percentage', title_x=0.5,
yaxis=dict(title='Frequency', color='pink', side='left'),
yaxis2=dict(title='Percentage (%)', overlaying='y', side='right', color='purple'))
# Update the title of the xaxis according to the option chosen
if option == 1:
fig.update_layout(xaxis_title='Amino Acids')
plot(fig, filename='amino_acids.html', auto_open=False)
elif option == 2:
fig.update_layout(xaxis_title='Category of Amino Acids')
plot(fig, filename='amino_acids_by_category.html', auto_open=False)
elif option == 3:
fig.update_layout(xaxis_title=f"{category} Amino Acids")
plot(fig, filename=f"{category}_amino_acids.html", auto_open=False)
# Show plot
fig.show()
atoms = []
for ID in valid_ids:
atoms.extend(atom_list(ID))
# Print the options and enter the choice
print("\nMenu:\n1. ALL\n2. Per category\n3. Within category\n4. Specific AA")
choice = input("Enter your choice: ")
# Calculate and display the statistics according to the choice
if choice == '1':
display_all_stats(atoms)
elif choice == '2':
display_per_category_stats(atoms)
elif choice == '3':
print("\nCategories:\n- Positively charged\n- Negatively charged\n- Polar\n- Non-polar")
category = input("Enter your category: ").lower().capitalize()
display_within_category_stats(atoms, category)
elif choice == '4':
display_specific_aa_stats(atoms)
else:
print("Invalid choice. Please enter a number between 1 and 4.")
Menu: 1. ALL 2. Per category 3. Within category 4. Specific AA
Amino Acid | Frequency | Percentage
-------------------------------------
L | 1572 | 8.73%
A | 1564 | 8.68%
G | 1551 | 8.61%
V | 1324 | 7.35%
T | 1129 | 6.27%
R | 1125 | 6.25%
E | 1108 | 6.15%
D | 1040 | 5.77%
K | 993 | 5.51%
S | 972 | 5.40%
I | 928 | 5.15%
P | 806 | 4.48%
N | 715 | 3.97%
Q | 675 | 3.75%
Y | 660 | 3.66%
F | 638 | 3.54%
H | 418 | 2.32%
M | 319 | 1.77%
W | 295 | 1.64%
C | 177 | 0.98%
Part 3
For the last part, the program will compute the interatomic distance between the CA atom of the first and last standard amino acids for each of the protein PDB files previously retrieved from the database. The code will return the minimum, the maximum, the median, the mean and the standard deviation of the distance with a plot/graph to show the distribution of these interatomic distances.
The xyz coordinates of each atom in the PDB file are between positions 31 and 54 of each line starting with "ATOM". You can check the same link provided above for a description of the PDB file format.
import numpy as np
# Function that returns a list of the amino acid coordinates
def atom_coo_list(ID):
atoms_coo = []
with open(f"{ID}.pdb", 'r') as f:
for line in f:
if line[0:6].strip() == 'ATOM' and line[12:16].strip() == 'CA':
atoms_coo.append((float(line[30:38]), float(line[38:46]), float(line[46:54])))
# If the line starts with 'ATOM' and contains 'CA' between the indexes 12 and 16, append the coordinates x,y,z in float type in a tuple of the form (x,y,z)
return atoms_coo
# Function to calculate the interatomic distance
def compute_distance(coords1, coords2):
return np.sqrt((coords1[0] - coords2[0])**2 + (coords1[1] - coords2[1])**2 + (coords1[2] - coords2[2])**2)
# Method calculates interatomic distance = sqrt((x0 -x1)^2+(y0-y1)^2+(z0-z1)^2)
# Function that computes the interatomic distances between the 1st and last amino acid in each pdb file
# and that calculates the statistics, and shows them in a plot
def process_pdb_files(valid_ids):
all_distances = [] # Contains interatomic distances of each file
for ID in valid_ids:
atoms_coo = atom_coo_list(ID) # List of tuples which represent the coordinates of each atom in the file (as specified previously)
if len(atoms_coo) >= 2:
# If the number of atoms in the file is >=2, calculate the interatomic distances between the first and last atom in the file aka first and last atom in the list
distance = compute_distance(atoms_coo[0], atoms_coo[-1])
all_distances.append(distance)
stats = calculate_stats(all_distances)
generate_interactive_plot(all_distances, calculate_stats(all_distances)['Mean'], calculate_stats(all_distances)['Median'], calculate_stats(all_distances)['Standard Deviation'])
return stats
# Function that returns a dictionary containing the statistics
def calculate_stats(distances):
return {
'Minimum': np.min(distances),
'Maximum': np.max(distances),
'Median': np.median(distances),
'Mean': np.mean(distances),
'Standard Deviation': np.std(distances)
}
# Function to generate an interactive plot to illustrate graphically the results of the statistics
def generate_interactive_plot(distances, mean_dist, median_dist, std_dev):
fig = go.Figure()
# Add histogram for distance distribution
fig.add_trace(go.Histogram(x=distances, nbinsx=30, name='Distances', marker=dict(color='blue', line=dict(color='white', width=1))))
# Calculate the histogram bin data to find max frequency
fig_data = fig.data[0] # Access histogram data
hist_values, hist_edges = np.histogram(distances, bins=30) # Get frequencies and bins
max_frequency = max(hist_values) # Get the maximum frequency from the histogram
# Add lines for mean, median, and standard deviation, making them interactive
fig.add_trace(go.Scatter(x=[mean_dist,mean_dist], y=[0, max_frequency], mode="lines", name='Mean', line=dict(color='green', width=3, dash='dash')))
fig.add_trace(go.Scatter(x=[median_dist, median_dist], y=[0, max_frequency], mode="lines", name='Median', line=dict(color='orange', width=3, dash='dash')))
fig.add_trace(go.Scatter(x=[mean_dist - std_dev, mean_dist - std_dev], y=[0, max_frequency], mode="lines", name='Mean - 1 SD', line=dict(color='red', width=3, dash='dot')))
fig.add_trace(go.Scatter(x=[mean_dist + std_dev, mean_dist + std_dev], y=[0, max_frequency], mode="lines", name='Mean + 1 SD', line=dict(color='red', width=3, dash='dot')))
# Update layout with title, labels, and legend
fig.update_layout(
title='Distribution of Interatomic Distances (First vs. Last CA Atoms)',
xaxis_title='Interatomic Distance',
yaxis_title='Frequency',
showlegend=True, # Ensures that the legend is displayed
)
# Show the plot
fig.show()
plot(fig, filename='interatomic_distribution.html', auto_open=False)
statistics = process_pdb_files(valid_ids)
print("Statistics of Interatomic Distances:")
for key, value in statistics.items():
print(f"{key}: {value:.2f}")
Statistics of Interatomic Distances: Minimum: 5.54 Maximum: 182.22 Median: 45.13 Mean: 50.79 Standard Deviation: 37.97